In this homework, we will focus on other methods for explaining machine learning models. Specifically, we will consider:
Ceteris Paribus - a method for explaining individual predictions of a model (local explanation)
Partial dependency - a method for explaining the behavior of a model (global explanation)
We will use the same dataset as in the previous homework - the Heart dataset. The goal of this homework is to get familiar with the above methods and to understand their advantages and disadvantages.
Ceteris Paribus phrase is a Latin phrase meaning "all other things being equal". It is used to describe a situation in which the value of a variable is held constant while the value of another variable is varied. In the context of machine learning, Ceteris Paribus is a method for explaining individual predictions of a model.
Another name for this method is Individual Conditional Expectation (ICE).
Partial dependency is a method for explaining the behavior of a model. It is a graphical representation of the relationship between the target variable and a set of explanatory variables, holding other explanatory variables constant. It is closely related to the Ceteris Paribus method, but instead of explaining a single prediction, it explains the behavior of the model - global view based on the average prediction for each value of the feature.
I will use the xgboost model to predict the heart disease of two selected patients. I will use the Ceteris Paribus method to explain the prediction of the model for each patient (56 and 17).
The first patient, number 56, is a Male, aged 48 years, with chest pain type 0 (typical angina), resting blood pressure of 122 mm Hg, cholesterol of 222 mg/dl, fasting blood sugar of 0 (false), resting electrocardiographic results of 0 (normal), maximum heart rate achieved of 186, exercise-induced angina of 0 (no), ST depression induced by exercise relative to rest of 0.0, the slope of the peak exercise ST segment of 2, number of major vessels of 0, and thalassemia of 2 (reverse).
The model predicts that the patient has a ~0.99 probability of NOT having a heart disease. The model's prediction is correct, as the patient`does not have a heart disease.
Let's say we want to know in detail, how would the model response for patient 56 change if we change the value of the selected features. We will use the Ceteris Paribus method to analyze the model's response to changes in the selected features :
age - the age of the patient - continuous variable
thalach : maximum heart rate achieved - continuous variable
trtbps : resting blood pressure (in mm Hg) - continuous variable
chol : cholesterol in mg/dl fetched via BMI sensor - continuous variable
Based on the plot below we can say that:
the variable age if it's in the range 34-57, the model response is constant and the model predicts that the patient has a ~0.99 probability of NOT having a heart disease. If the age is >57, the model response is decreasing, achieving the minimum value of ~0.76 probability of NOT having a heart disease for the age of 61. Interestingly, an increase in the age above 61, decreases the probability of having a heart disease only to stay constant after the age of 65. It shows that the patient has a window of opportunity for the heart disease to appear starting at the age of 61 and ending at the age of 65 and the model is able to predict it.
the variable thalach if it's in range 72-135, the model response is constant and the model predicts that the patient has a ~0.87 probability of NOT having a heart disease. If the maximum heart rate achieved is >135, the model response shows an increasing trend with some fluctuations. For values greater than 171, the response becomes constant.
changing the value of variable trtbps does not change much the model's response. The model predicts that the patient has a ~0.99 probability of NOT having a heart disease for most of the trtbps values. It suggests that the model is not sensitive to the changes in the resting blood pressure of this patient.
when it comes to cholesterol levels, we can see that by increasing this variable just by 20 units (from 222 to 242), the model response would drop from ~0.99 to ~0.90. Interestingly, increasing the cholesterol level from 242 to 250 would decrease the probability of having heart disease from (~0.1 to ~0.05). Further increase in the cholesterol level would (in the long run) increase the probability of having heart disease.
The second patient, number 17, is a Female aged 66 years, with chest pain type 3 (typical angina), resting blood pressure of 150 mm Hg (trtbps), cholestoral of 226 mg/dl (chol), fasting blood sugar of 0 (false) (fbs), resting electrocardiographic results of 0 (normal) (restecg), maximum heart rate achieved of 114 (tbalachh), exercise induced angina of 0 (no) (exng), ST depression induced by exercise relative to rest of 2.6 (oldpeak), the slope of the peak exercise ST segment of 0 (slp), number of major vessels of 2 (caa), and thalassemia of 2 (normal) (thall).
The model predicts that the patient has a ~0.019 probability of not having heart disease. The model's prediction is correct, as the patient actually does have a heart disease.
For the second patient, the results are quite different. The Ceteribus Paribus analysis shows that:
the variable age if it's in the range 34-46, the model response is constant and the model predicts that the patient has a ~0.5 probability of NOT having a heart disease. If the age is in the interval 46 - 52 the patient has a window of opportunity to develop heart disease, with the highest probability at the age ~47. Passing the age of 47, the probability of having heart disease is decreasing only to stay constant between the ages of 51 and 56. Then the second window of opportunity appears in the interval 56 - 64. The highest probability of having heart disease is at the age of 61 achieving a record high of ~0.8 probability of having a heart disease. For the age 61 and above, the probability of having heart disease is decreasing only to stay constant after the `age of 64.
for the variable thalach we see contrary to the first patient, the model response is decreasing for the values of maximum heart rate achieved higher than 154, achieving the minimum value of ~0.467 probability of NOT having a heart disease for the value of 158. For the values higher than 158, the model response is increasing with some fluctuations until the value of 178, where the response becomes constant.
the results for the variable trtbps suggests that interval 114 - 142 is the window of opportunity for the patient to develop heart disease. After passing 142 trtbs, the probability of having heart disease is constant.
based on the results for the variable chol, we can see that the higher the value, the lower the probability of Heart disease. It is contrary to the results for the first patient, where the higher the value of cholesterol, the higher the probability of having heart disease. It is an example of how the Ceteribus Paribus explanation is sensitive to the data and how it can be misleading. We need to have in mind that the CP explanation is local thus more such examples can be found. The further explanation for this, is that the XGBoost model takes into account the interactions between the features.
Now we will compare two methods of explaining the model's response to changes in the selected features: Ceteris Paribus and Partial Dependence Plots (PDP). It is important to note that the CP method is local and PDP is global. The CP method is local because it explains the model's response to changes in the selected features for a single observation. The PDP method is global because it explains the model's response to changes in the selected features for the whole dataset.
The PDP plot below shows the model's response to changes in the selected features for the whole dataset. The plot shows that:
the age variable indeed determines the window of opportunity for the patient to develop heart disease. The probability of having heart disease is below the mean prediction for the ages 46 - ~51 and 57 - 66 achieving a record high at the age 61 of ~0.67, similar to the results from the Ceteris Paribus analysis for patients 17 and 56.
the variable thalachh - maximum heart rate achieved has an increasing trend - the higher the value, the higher the probability of NOT having heart disease. However, the probability of having heart disease is constant for the values of thalachh in the range ~71 - 137.
small variance of the variable trtbps suggests that the model is not sensitive to the changes in resting blood pressure. It agrees with the results from the Ceteris Paribus analysis for patient 56 however, it is in contrast with the results from the Ceteris Paribus analysis for patient 17 - the model is very sensitive to the changes in the resting blood pressure for this patient.
for the variable chol we see decreasing trend - the higher the value, the lower the probability of NOT having heart disease. It agrees with the results from the Ceteris Paribus analysis for patient 56, however, it is in contrast with the results from the Ceteris Paribus analysis for patient 17 - the trend was increasing for this patient.
Let us know investigate the PDP explanations for the Logistic Regression and XGBoost models and see if there are any differences between them. It is worth noting, that the Logistic regression model does not take into account the interactions between the features and it is a linear model. The XGBoost model is non-linear and it takes into account the interactions between the features.
The plot shows many interesting things, mainly:
Logistic Regression fails to capture the windows of opportunity for the patients to develop heart disease. As it being a linear model, cannot capture the non-linear curvature of the data.
XGBoost model captures the windows of opportunity for the patients to develop heart disease.
for the variable thalachh - maximum heart rate achieved we see that Logistic regression model, shows increasing trend in the whole range of the variable, while the XGBoost shows dynamicly changing trend - mostly consant, but with some increasing and decreasing periods, but the overall trend is increasing - the higher the value, the higher the probability of NOT having heart disease.
for the the variable trtbps we see that Logistic regression model, shows gentle decreasing trend. Small slope factor suggests that the Logistic Regression model is not sensitive to the changes in the resting blood pressure. XGBoost model shows mostly constant trend.
For the variable chol the Logistic regression model shows decreasing trend, and the slope factor is even smaller than for the variable trtbps suggesting that the Logistic Regression model is even less sensitive to the changes in the cholesterol level. However for the XGBoost model, we see that the variance of the variable chol is much higher than for the variable trtbps suggesting that the XGBoost model is more sensitive to the changes in the cholesterol level. It means that for the Logistic Regression model the chol variable might be less important than for the XGBoost model.
In this notebook, we investigated in detail the Ceteris Paribus method and Partial Dependency Plots. We explained XGBoost model predictions using CP based on two patients. Specifically, we were able to explain the model's response to changes in the selected features:
age,
thalachh - maximum heart rate achieved,
trtbps - resting blood pressure,
chol - serum cholesterol in mg/dl.
We also compared the CP and PDP explanations for the XGBoost model keeping in mind that the CP method is local and PDP is global. We were able to show similarities in model response, but also some differences.
We compared the results of the Partial Dependency Plots method for the Logistic Regression model and XGboost. We saw that the Logistic Regression model is linear and it cannot capture the non-linear curvature of the data. We saw that the XGBoost model is non-linear and it can capture the non-linear curvature of the data. XGBoost model captured the windows of opportunity for the patients to develop heart disease. We concluded that the XGBoost model is more sensitive to the changes in the cholesterol level than the Logistic Regression model.